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ABSTRACT 

A random group of 49 items was drawn from nine 
coaimerically available reading comprehension tests. Each test was 
classified independently by two judges as either a measure of the 
ability to find answers to questions answered explicitly or in 
paraphrase in the passages, a measure of the ability to draw 
inferences or deductions, or a measure of some "other" skill. Both 
judges classified a majority of the items as measures of the ability 
to draw inferences or deductions, and there was a reasonable amount 
of agreement between the judges in this classification process. The 
judges also indicated the extent to which they thought seven types of 
faults were present in each item- One judge found a total of 122 
faults in the 49 items: the other judge found 31. The judges were 
most often in agreement in judging items to be measures of general 
knowledge rather than measures of the ability to comprehend specific 
passages, (Author) 
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commercially available reading comprehension tests* Each test 
was classified independently by tiuo judges as either a measure 
of the ability to find answers to questions answered explicitly 
or in paraphrase in the passages, a measure of the ability to 
draw inferences or deductions, or a measure of some "other" 
skill. Both judges classified a majority of the items as 
measures of the ability to draw inf erences or deductions , and 
there was a reasoRSble amount of agreement between the judges 



in this cl^er^Tf icatian process. The judges alsa indicated the 



extent to which they thought seven types of faults uers present 
in each item. One judge found a total of 122 faults in the U9 
items; the other judge found 31. The judges were mast often 
in agreement in judging items to be measures of general knowl- 
edge rather than measures of the ability to comprehend specific 
passages. 
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Subjective Evaluation of the Quality of Standardized Reading-Comprehension 
Test Items' 1 

Fred Pyrczak 

California State University, Los Angeles' 

Writing mult iple-chaice items af high quality requires considerable 
insight into the content and intellectual skills that are to be measured, 
the desirable characteristics and limitations of multiple-choice items, 
and the probable reactions of examinees to the items* Because item- 
writing is a complicated skill, it is not surprising that a relatively 
large number of faulty items in standardized tests have been identified 
by subject-matter specialists and scholars (e.g., Hoffmann, 1962). The 
basic purpose of the present study was to determine the extent to which 
faults are present in the items in a specific set of standardized reading- 
comprehension tests. The subjective analysis of item quality conducted 
in this study differed from earlier analyses in three important respects: 
a sample of items was drawn systematically for analysis in this study, 
two judges independently rated the quality of each item, and bath judges 
used the same rating scale when evaluating each item. 

PROCEDURES 

Sample . A set of nine standardized reading-comprehension tests, which 
are listed in the Test Reference List at the end of this paper, were 
selected for use in this study. All of the tests were currently available 
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from commercial publishers at the time of the study ; and all were designed 
far use with junior- and senior-high school students. Only those items 
that ask questions about specific reading passages presented in the 
tests were used. Becauss in mast af the tests mare than one question is 
asked about each passage and because it was .desirable to examine the 
possible interrelatedness af the items for a given passage, a sample af 
items was drawn indirectly by random selection af passages from each 
test.. Passages were selected randomly from each test until at least 
five per cent of the total number of items were included in the sample. 
No mare than ten per cent of the items from any given test were included 
in the total sample of itsms. A total of k3 items was selected. 



Analysis . Each item selected for use in this study was evaluated inde- 
pendently by two judges. 2 A special rating form was developed ta aid 
the judges. The first part of the farm asked the judges to indicate 
the skill they thought each item was designed to measure: (1) finding 
the answers to questions answered explicitly ar in paraphrase, (2) drawing 
inferences or deductions, or (3) same "other" skill* 

The second part of the form presented the judges with seven potential 
item faults. These were: 

1. Inadequate keyed choice (i^e. , the choice designated as 
"correct" is not thorough!^/ correct). 



2. Defensible distracter(s# ■ 

can be defended as tha^cajjri 



^^ane ar more "incorrect" choices 
lice}. 



William R. Crawf ordy/University /of California, Las Angeles and 
Mary B. Uillis, Americar/lnstitute^Nfar Research, Palo Alto served as 
the judges. 
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3. Information other than that provided in the passage is needed 
in order to identify the keyed choice. 

*4. Question measures general knowledge (i.e., examinees may be 

Eable to answer on the basis of their knowledge without reading 
the associated passage). 

5. Item is related to another item on the same passage in such a 
way that the interrelationship may aid an examinee who has not 
carefully considered the passage. 

6. Distracters are not homogeneous with keyed choice (i.e., keyed 
choice is more general, longer, etc.). 

7* Other faults. 

Faults three, four, and five refer specifically to multipla,»choice items 
designed to be paasage-dependent » These faults have been discussed at 
length by Pyrczak (1972, 1973a). 

For each item, the judges uere asked to indicate which faults, if any, 
were present. For each fault, furthermore, the judges were asked to in- 
dicate the extent to which the fault is detrimental to the item's ability 
to discriminate between th^se uho do and those who do not have the 
reading skill in question by checking either "not detrimental" "moderately 
detrimental, " or "seriously detrimental." A similar three-point rating 
scale previously has been used successfully in evaluating the quality 
of arithmetic-reasoning items (Pyrczak, in press). The judges also were 
asked to give a written explanation for each fault that they found. 

RESULTS 

Skills measured . One judge classified '\k of the items as measures of 
the ability to find answers to questions answered explicitly dt in 
paraphrase in the passages, 31 as measures of the ability to draw inferences 
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or deductions, and k as measures of same "□ther 11 skill. The other judge 
classified 20, 26, and 2 items as belonging to these three skill areas, 
respectively. The second judge did not indicate the' type of skill 
measured by one of the items. The tun judges agreed cn the classification 
af 31 of the k3 items. This is a fairly high rate of agreement con- 
sidering the types of judgments involved. Py rczak ( 1573b ) has discussed 
some of the problems involved in classifying reading-canipr ehensian itsms 
in terms of the skills they appear to measure. 

Faults present . One judge found a total of 122 faults in the k3 items 
uhile the other judge found only 31 faults. Clearly, the tuo judges 
applied different standards when rating the items and had different types 
af insights into the content of the items and their relationships with 
the passages. Thus, by conventional standards there uas a Iouj rate of 
interobserver agreement. Table 1 indicates the number of times bath 
judges agreed that a particular type of fault uas aresent in a given 
item. It is interesting to note that both judges thought that seven 
items, to some extent, uere measures of general knowledge. 



INSERT TABLE 1 ABOUT HERE 

Table 2 shous the number of faults found in the ^9 items by each judge. 
It is interesting that each judge found e ei c h type of fault at least once. 



INSERT TABLE 2 ABOUT HERE 



A major weakness of the present study was the low rate of agreement 
between the judges an the presence Dr essence of faults in the item'i. 
While the rate of agreement was disappointingly lou, it was not especi- 
ally surprising considering the subtle factors involved in the types 
of judgments that the experts were asked to make. It is interesting 
to note that as part of a larger study Pyrczak (in press) had arithmetic- 
reasoning items rated for quality by three judges using a check list 
similar to that used in this study and nbtained fairly consistent ratings 
Thus, it may be that making judgments of the quality of arithmetic items 
is a more clear-cut process than making judgments of the quality of 
reading-comprehension items. Clearly, further investigation is needed 
to determine if procedures can be developed for obtaining consistent, 
independent judgments of the quality of reading-ccrnprahension items. 
Such procedures would be very helpful when editing end revising reading- 
comprehension items during test construction. 

Because of the limitations af the rating process, it is difficult 
to draw an overall generalization regarding the extent to which faults 
are present in standardized reading-comprehension tests. It seems 
reasonable to conclude, however, that a majority of the items will be 
subject to some type of criticism if carefully examined by experts. 

The judges most often agreed on the absence of passage-dependence 
due to items measuring general knowledge as a fault. Pyrczak (1S72) 
suggested an empirical method of identifying items uith this fault. 
Specifically, he suggested administering reading-comprehension questions 



in the absence af the associated passages and asking examinees to indicate 
the basis or bases for their responses. 

In conclusion, a majority of the items in reading-comprehension tests 
appear to be measures of the ability to draw inferences and deductions 
from reading material, and a majority appear to be subject to same type 
of criticism when critically examined by experts. Obtaining agreement 
among experts an the number and nature of the faults in a given item 
uhen it is examined independently by them appears to be difficult and 
is a topic that deserves further investigation • 
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Table 1: Number of times both judges found each fault in a given Item* 

\ 



PJumber cf times 
■'Qth judges found the 
Fault • fault in an item 



Inadequate keyed choice 0 

Defensible distracter(s) 3 

Information other than that provided 

in the passage is needed in order 

to identify the keyed choice 2 

Question measures general knowledge 7 

Item is related to another item on 

the same passage in such a way that 

the interrelationship may aid an 

examinee who has not carefully 

considered the passage D 

Distracters are not homogeneous with 

keyed choice 2 

Other faults 3 



ERIC 



Table 2: Number cf faults found in the items by each judge. 



Fault Judge 1 

Inadequate keyed choice 1fl 
Not detrimental 9 
Moderately detrimental 5 
Seriously detrimental k 



Defensible di3tracter( s ) 25 
Not detrimental 
Moderately detrimental 
Seriously detrimental 

Information other than that provided 
in the passage is needed in order 
to identify the keyed choice 16 

IMot detrimental 

Moderately detrimental 

Seriously detrimental 

Question measures general knowledge 22 
Mat detrimental 
Moderately detrimental 
Seriously detrimental 

Item is related to another item on 

the same passage in such a taay that 

the interrelationship may aid an 

examinee who has not carefully 

considered the passage 1D 

IMot detrimental 

Moderately detrimental 

Seriously detrimental 

Distracters are not homogeneous 

with keyed choice 12 

(Mot detrimental 

Moderately detrimental 

Seriously detrimental 

Other faults 19 
[Mot detrimental 
Moderately detrimental 
Seriously detrimental 
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Total 



122 



31 



